Question 1. Why can’t I just get data from a few different cities and run the regression of “Crime” on “Police” to understand how more cops in the streets affect crime? (“Crime” refers to some measure of crime rate and “Police” measures the number of cops in a city.)
Answer: In this question of “Crime” and “Police”, there is a problem of endogeneity circle. if we use the simple linear regression model \[y=\beta_1+\beta_2 * x+u\], there is a problem of endogeneity, i.e. \(E(xu)≠0\). Under this circumstance, inconsistency of OLS arises due to the fact that changes in x(police force) are associated not only with changes in y(crime), but also change in u.
Question2. How were the researchers from UPenn able to isolate this effect? Briefly describe their approach and discuss their result in the “Table 2” below, from the researchers’ paper.
Answer: In this paper, the authors use a “high-alert periods” dummy variable to break the circle of endogeneity so that to estimate the effect of police on crime. The reason why the authors choose “high-alert periods” is that the primary purpose of the HSAS is to inform and coordinate the anti-terrorism efforts of all federal agencies. So, the level of alert will directly impact the number of police on the specific district. In addition, the authors use daily data to make sure the “treatment windows” are short. Plus, they chose data which includes the information repeated terror alert, so it reduces the possibility of spurious correlation. Furthermore, the model also uses a “Metro ridership” variable to test whether there is a correlation between tourism and crime. At last, the authors use dummy variables for each day of the week to control for day effects.
From the Table 2 column 1, the coefficient on the alert level is statistically significant at the 5 percent level and indicates that on high-alert days, total crimes decrease by an average of seven crimes per day, or approximately 6.6 percent. From the Table 2 column 2, we verify that high-alert levels are not being confounded with tourism levels by including logged midday Metro ridership directly in the regression. The coefficient on the alert level is slightly smaller, at -6.2 crimes per day. We also find that increased Metro ridership is correlated with an increase in crime. The increase, however, is very small, a 10 percent increase in Metro ridership increases the number of crimes by only 1.7 per day on average. Thus, given that midday Metro ridership is a good proxy for tourism, changes in the number of tourists cannot explain the systematic change in crime that we estimate.
Question3. Why did they have to control for Metro ridership? What was that trying to capture?
Answer: In order to test a hypothesis that tourism is reduced on high-alert days, and as a result, there are fewer potential victims, which leads to fewer crimes.
Question4. Below I am showing you “Table 4” from the researchers’ paper. Just focus on the first column of the table. Can you describe the model being estimated here? What is the conclusion?
Answer: The model in table 4 includes district fixed effects in order to distinguish the peculiar crime pattern of each district, and all regressions contain day-of-the-week fixed effects. The dependent variable is daily crime totals by district. From Table 4, During periods of high alert, crime in the National Mall area decreases by 2.62 crimes per day. Crime also decreases in the other districts, by 0.571 crimes per day, but this effect is not statistically significant. Since there are 17.1 crimes on the district 1, the declination during high-alert days is approximately 15 percent, which means almost one-half of the total crime decline during high-alert periods is concentrated in District 1. In addition, The result elasticity of crime with respect to police is -0.3, which is consistent with other researchers’ results.
| CART | Random Forest | Boosting |
|---|---|---|
| 36.06927 | 31.39588 | 33.9174 |
The results suggest that random forest model have the best performance on the testing data.
As landlords or renters, what they mostly care about is their possible revenue per square foot per calendar year. As we all know, leasing revenue depends on many factors, such as the building age, facilities, amenities, building size and so on. Nowadays, people pay more attention to their living environment when renting a house, such as whether the landlord has a green certification. Therefore, it’s meaningful to do some research on the relationship between rental income and green certification.
In our report, we build possibly the best predictive model for revenue per square foot per calendar year and to use this model to quantify the average change in rental income per square foot associated with green certification, holding other features of the building constant.
In our raw data set, there are 7,820 data points. We first filter out the missing data. Now, our new greenbuildings data set has 7820 observations.
The predictive variable is revenue per square foot per year, which is the product of two terms: rent and leasing_rate.
The features used to build our model are:
(1) cluster: an identifier for the building cluster, with each cluster containing one green-certified building and at least one other non-green-certified building within a quarter-mile radius of the cluster center.
(2) size: the total square footage of available rental space in the building.
(3) empl.gr: the year-on-year growth rate in employment in the building's geographic region.
(4) stories: the height of the building in stories.
(5) age: the age of the building in years.
(6) renovated: whether the building has undergone substantial renovations during its lifetime.
(7) (8) class.a, class.b: indicators for two classes of building quality (the third is Class C). These are relative classifications within a specific market. Class A buildings are generally the highest-quality properties in a given market. Class B buildings are a notch down, but still of reasonable quality. Class C buildings are the least desirable properties in a given market.
(9) green.rating: an indicator for whether the building is either LEED- or EnergyStar-certified.
(10) net: an indicator as to whether the rent is quoted on a "net contract" basis. Tenants with net-rental contracts pay their own utility costs, which are otherwise included in the quoted rental price.
(11) amenities: an indicator of whether at least one of the following amenities is available on-site: bank, convenience store, dry cleaner, restaurant, retail shops, fitness center.
(12) cd.total.07: number of cooling degree days in the building's region in 2007. A degree day is a measure of demand for energy; higher values mean greater demand. Cooling degree days are measured relative to a baseline outdoor temperature, below which a building needs no cooling.
(13) hd.total07: number of heating degree days in the building's region in 2007. Heating degree days are also measured relative to a baseline outdoor temperature, above which a building needs no heating.
(14) total.dd.07: the total number of degree days (either heating or cooling) in the building's region in 2007.
(15) Precipitation: annual precipitation in inches in the building's geographic region.
(16) Gas.Costs: a measure of how much natural gas costs in the building's geographic region.
(17) Electricity.Costs: a measure of how much electricity costs in the building's geographic region.
(18) City_Market_Rent: a measure of average rent per square-foot per calendar year in the building's local market.
| Random Forest | Boosting | KNN | LASSO |
|---|---|---|---|
| 838.2353 | 982.469 | 1000.042 | 1073.803 |
We first split the new greenbuildings data set into training and test set, then we use four different tree methods: random forest, boosted regression trees, KNN regression and LASSO regression to build our model. Table 3.1 shows that random forest model is the best predictive model given out-of-sample accuracy. Therefore, we choose random forest model to identify the association between rental revenue and green certification.
Figure 3.1: Figure of All Features Descending by Importance
From Figure 3.1, we found that the most important features that landlords should consider are size, stories, age. These variables contribute the most to the model. However, green_rating seems not very important.
Figure 3.2: Impact of Green Certifications on Revenue
Figure 3.2 includes the information between revenue and green_rating, the average revenue per square foot is around \(2693.526\) for buildings with a green certification, and around \(2378.839\) for those without a certification. This shows that buildings with green certification get more revenue than those without green certification, but the impact is small.
Figure 3.3: Partial Effect of Green Certifications on Revenue
From 3.3, we found a positive association between revenue and green certification, which means green certification is expected to increase revenue per square foot per year by \(65.37967\) on average holding all else fixed.
From our analysis, we build a random forest model to estimate the association between green certification and revenue per square foot per year. By our results, we found that green certification has slightly positive impact on revenue. More specifically, green certification is associated with a \(65.37967\$\) increase in revenue per square foot per year holding all else fixed. In addition, we found that size, age, stories have more impacts on revenue. Based on our research, the landlords should weigh the pros and cons of getting a green certification, thinking critically about the cost and revenue change of getting a green certification. If they think that the benefit doesn’t overweight the cost of getting a certification, they should focus on other features, which have more important impact on revenue.
Our purpose is to build the best predictive model to forecast the median house value in California. According to exercise description, in the beginning we have to conduct standardized processing for totalRooms and totalBedrooms by creating the new variables including sdrooms and sdbedrooms. Then we utilized linear regression (with stepwise variable selection) and tree models to build the best predictive model.
Now, we list their RMSEs to find out which model is the best predictive model.
| OLS | Stepwise | CART | Random Forest | Boosting |
|---|---|---|---|---|
| 73069 | 69779.6 | 63738.51 | 52701.98 | 73951.02 |
Table 5.1 shows that the random forest model has the lowest RMSE, so we employ it to build our best predictive model.
Figure 5.1: Actual Median House Values in California
Figure 5.2: Predicted Values in California
Figure 5.3: Model’s Errors
We can find that above graphs perform well, because 5.1 almost match 5.2. Therefore, we can confirm that the random forest model does well in the prediction. In conclusion, our predictive model is effective, then we may use this predictive model to forecast other states or areas in United State.